g03aaf
g03aaf
© Numerical Algorithms Group, 2002.
Purpose
G03AAF Performs principal component analysis
Synopsis
[e,p,v,s,ifail] = g03aaf(x<,matrix,isx,s,wt,std,weight,ifail>)
Description
Let X be a n by p data matrix of n observations on p variables
x ,x ,...,x and let the p by p variance-covariance matrix of
1 2 p
x ,x ,...,x be S. A vector a of length p is found such that:
1 2 p 1
T T
a Sa is maximized subject to a a =1.0
1 1 1 1
p
--
The variable z = > a x is known as the first principal
1 -- 1i i
i=1
component and gives the linear combination of the variables that
gives the maximum variation. A second principal component, z =
2
p
--
> a x , is found such that:
-- 2i i
i=1
T T T
a Sa is maximized subject to a a =1.0 and a a =0.0
2 2 2 2 2 1
This gives the linear combination of variables that is orthogonal
to the first principal component that gives the maximum
variation. Further principal components are derived in a similar
way. The elements of the vectors a are known as the principal
i
component loadings.
The vectors a ,a ,...,a , are the eigenvectors of the matrix S
1 2 p
2
and associated with each eigenvector is the eigenvalue, (lambda).
i
2 -- 2
The value of (lambda) / > (lambda) gives the proportion of
i -- i
variation explained by the ith principal component. Alternatively
the a 's can be considered as the right singular vectors in a
i
singular value decomposition with singular values (lambda) of
i
the data matrix centred about its mean and scaled by 1/(n-1).
This latter approach is used in G03AAF.
Principal component analysis is often used to reduce the
dimension of a data set, replacing a large number of correlated
variables with a smaller number of orthogonal variables that
still contain most of the information in the original data set.
The choice of the number of dimensions required is usually based
on the amount of variation accounted for by the leading principal
components. If k principal components are selected then a test of
the equality of the remaining p-k eigenvalues is
{ p ( p )}
{ -- 2 ( -- 2 )}
(n-(2p+5)/6){- > log((lambda) )+(p-k)log( > (lambda) /(p-k))}
{ -- i ( -- i )}
{ i=k+1 ( i=k+1 )}
2
which has, asymptotically, a (chi) distribution with
1
-(p-k-1)(p-k+2) degrees of freedom.
2
Equality of the remaining eigenvalues indicates that if any more
principal components are to be considered then they all should be
considered.
Instead of the variance-covariance matrix the correlation matrix,
the sums of squares and cross-products matrix or a standardised
sums of squares and cross-products matrix may be used. In the
-1/2 -1/2
last case S is replaced by (sigma) S(sigma) for a
diagonal matrix (sigma) with positive elements. If the
2
correlation matrix is used the (chi) approximation for the
statistic given above is not valid.
The principal component scores are the values of the principal
component variables for the observations. These can be
standardised so that the variance of these scores for each
principal component is 1.0 or equal to the corresponding
eigenvalue. The principal component scores correspond to the
left-hand singular vectors.
Weights can be used with the analysis, in which case the matrix X
is first centred about the weighted means then each row is scaled
__
by an amount /w , where w is the weight for the ith
\/ i i
observation.
Parameters
g03aaf
Required Input Arguments:
x (:,:) real
Optional Input Arguments: <Default>
matrix (1) string 'v'
isx (:) integer ones(size(x,2),1)
s (:) real zeros(size(x,2),1)
wt (:) real zeros(size(x,1),1)
std (1) string 'u'
weight (1) string 'u'
ifail integer -1
Output Arguments:
e (:,6) real
p (:,:) real
v (:,:) real
s (:) real
ifail integer